.. _explain-function - Many to One Explainer:

Many to One Explainer
=====================
The **Many to One Explainer** creates rule based explanations for many to one relationships.
It provides insights into how the input features define groups of output features.

Method Signature
-----------------------------------
.. code-block:: python

   ExpDataFrame.explain(
    explainer: Literal['fedex', 'outlier', 'many_to_one', 'shapley', 'metainsight']='fedex',
    attributes: List = None,
    use_sampling: bool | None = None,
    sample_size: int | float = 5000
    labels=None, coverage_threshold: float = 0.7,
    max_explanation_length: int = 3,
    separation_threshold: float = 0.3,
    p_value: int = 1,
    explanation_form: Literal['conj', 'disj', 'conjunction', 'disjunction'] = 'conj',
    prune_if_too_many_labels: bool = True,
    max_labels: int = 10,
    pruning_method='largest',
    bin_numeric: bool = False,
    num_bins: int = 10,
    binning_method: str = 'quantile',
    label_name: str = 'label',
    explain_errors=True,
    error_explanation_threshold: float = 0.05,
   )

Many to One Explainer Usage Example
-----------------------------------
.. code-block:: python

    # Import the necessary libraries
    import pandas as pd
    import pd_explain

    # Load the "adult" dataset
    adult = pd.read_csv(r'C:\adult.csv')

    # Call the many to one explainer
    adult.explain(explainer='many_to_one', labels='class')

**Output**:
.. table::

    +-----------------+----------------------------------------------------------+----------+------------------+--------------------------+
    | Group / Cluster | Explanation                                              | Coverage | Separation Error | Separation Error Origins |
    +=================+==========================================================+==========+==================+==========================+
    | <=50K           | 1 <= education-num <= 10                                 | 0.75     | 0.15             | 100.00% from group >50K  |
    | <=50k           | 0 <= capital-gain <= 5095.5                              | 1.0      | 0.21             | 100.00% from group >50K  |
    | <=50k           | 0 <= capital-gain <= 5095.5 AND 1 <= education-num <= 10 | 0.75     | 0.13             | 100.00% from group >50K  |
    | <=50k           | 0 <= capital-gain <= 4243.5                              | 0.99     | 0.2              | 100.00% from group >50K  |
    | >50K            | No explanation found                                     | NaN      | NaN              | NaN                      |
    +-----------------+----------------------------------------------------------+----------+------------------+--------------------------+

Coverage is the % of the group that is covered by the explanation.
Separation Error is the % of data outside the group that is covered by the explanation.

Parameters
-----------------------------------
- ``explainer`` (str): The explainer to use. This is shared with other explainers, but for the many to one explainer, it must be set to ``many_to_one``.
- ``attributes`` (list, optional): The attributes to consider when generating explanations. Default is ``None``.
- ``use_sampling`` (bool | None, optional): Whether to use sampling to speed up the computation. Default is to use the global setting.
- ``sample_size`` (int | float, optional): The number of samples to use. Default is ``5000``. Using a float between ``0`` and ``1`` will use that fraction of the data.
- ``labels`` (str | list | Series | DataFrame | ndarray | None): The labels defining the many to one relationship. Can be a name (or list of names) of a column in the DataFrame, a Series, a DataFrame, a numpy array, or None. None is only applicable for when the explainer is called on the result of a GroupBy operation, in which case the GroupBy groups will be inferred automatically. Otherwise, the labels must be provided. Defaults to `None`.
- ``coverage_threshold`` (float, optional): The minimum coverage required for an explanation to be considered. Default is ``0.7``.
- ``separation_threshold`` (float, optional): The minimum separation error required for an explanation to be considered. Default is ``0.3``.
- ``max_explanation_length`` (int, optional): The maximum number of conditions in an explanation. Default is ``3``.
- ``p_value`` (float, optional): A scaling parameter for the number of top attributes to consider when generating explanations. Number of attributes to consider = ``p_value`` * ``max_explanation_length``. Default is ``1``.
- ``explanation_form`` (str, optional): The form of the explanation. Default is ``conjunction``. Other options are ``disjunction``, or short forms ``conj`` and ``disj``.
- ``prune_if_too_many_labels`` (bool, optional): Whether to prune the labels to a smaller subset if there are too many. Default is ``True``.
- ``max_labels`` (int, optional): The number of labels to keep if ``prune_if_too_many_labels`` is ``True``. If there are less labels, no pruning will be performed. Default is ``10``.
- ``pruning_method`` (str, optional): The method to use for pruning labels. The options are:

    - ``largest``: Keeps the k most frequent labels.
    - ``smallest``: Keeps the k least frequent labels.
    - ``random``: Keeps k random labels.
    - ``max_dist``: Keeps the k labels with the largest mean distance between their centroids and the centroids of other labels.
    - ``min_dist``: Keeps the k labels with the smallest mean distance between their centroids and the centroids of other labels.
    - ``max_silhouette``: Keeps k labels with the largest silhouette score.
    - ``min_silhouette``: Keeps k labels with the smallest silhouette score.
    Default is ``largest``.

- ``bin_numeric`` (bool, optional): If the labels are numeric, whether to bin them into categories. Default is ``False``.
- ``num_bins`` (int, optional): The number of bins to use if ``bin_numeric`` is ``True``. If there are less unique values than ``num_bins``, no binning will be performed. Default is ``10``.
- ``bin_method`` (str, optional): The method to use for binning. The options are:

    - ``uniform``: Bins are of equal width.
    - ``quantile``: Bins are of equal frequency.
    Default is ``quantile``.

- ``label_name`` (str, optional): The name to give the labels if they are binned. Default is ``Label``. Only needed if the labels do not come from a Series / DataFrame with a name, and will only affect its display in the explanation. For example, you may see ``x <= label <= y`` as a group name.
- ``explain_errors`` (bool, optional): Whether to provide explanations for the origins of the separation error. Default is ``True``.
- ``error_explanation_threshold`` (float, optional): The threshold for much a group must individually contribute to the separation error to appear in the explanation. Groups that contribute less than this will be grouped together. Default is ``0.05``.

Other Usage Examples
--------------------
We will now show other examples of how to use the **many to one explainer** with different parameters.

Example 1: Explaining Clustering Results
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The **many to one explainer** works on any many-to-one relationship, including clustering results.

.. code-block:: python

    # Import the necessary libraries
    import pandas as pd
    import pd_explain
    from sklearn.cluster import KMeans

    # Load the adult dataset
    adult = pd.read_csv(r'C:\adult.csv')

    # Perform a clustering operation
    clusters = KMeans(n_clusters=3).fit_predict(adult)

    # Call the many to one explainer
    adult.explain(explainer='many_to_one', labels=clusters)

**Output**:
.. table::
    +-----------------+----------------------------------------------------------------+----------+------------------+-------------------------------+
    | Group / Cluster | Explanation                                                    | Coverage | Separation Error | Separation Error Origins      |
    +=================+================================================================+==========+==================+===============================+
    | 0               | 149278.5 <= fnlwgt <= 1490400                                  | 1.0      | 0.22             | 100.00% from group 1          |
    | 0               | 149278.5 <= fnlwgt <= 1490400 AND 8.5 <= education-num <= 16.0 | 0.87     | 0.21             | 100.00% from group 1          |
    | 1               | 291277.5 <= fnlwgt <= 1490400                                  | 1.0      | 0.0              | Rule has no separation error. |
    | 2               | 13769 <= fnlwgt <= 149278.5                                    | 1.0      | 0.0              | Rule has no separation error. |
    +-----------------+----------------------------------------------------------------+----------+------------------+-------------------------------+


Example 2: Explaining GroupBy Groups
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If you perform a group-by operation, you can then call the many to one explainer on the result to get insights into the groups.
Simply leave the ``labels`` parameter as ``None`` to infer the groups from the DataFrame.
Note that it is only with group-by operations that you can leave the ``labels`` parameter as ``None``, any other case requires you to provide the labels.

.. code-block:: python

    # Import the necessary libraries
    import pandas as pd
    import pd_explain

    # Load the adult dataset
    adult = pd.read_csv(r'C:\adult.csv')

    # Perform a group by operation
    gb_res = adult.groupby(['workclass', 'marital-status']).mean()

    # Call the many to one explainer, with some additional optional parameters to customize the output
    gb_res.explain(explainer='many_to_one', pruning_method='random', max_labels=3)

**Output**:
.. table::
    +---------------------------------------------+-----------------------------------------+----------+------------------+-----------------------------------------------------------------------------------------------------------------------------+
    | Group / Cluster                             | Explanation                             | Coverage | Separation Error | Separation Error Origins                                                                                                    |
    +=============================================+=========================================+==========+==================+=============================================================================================================================+
    | (' Self-emp-inc', ' Separated')             | 26 <= age <= 69                         | 1.0      | 0.23             | 83.33% from group (' Self-emp-inc', ' Married-spouse-absent'), 16.67% from group (' Without-pay', ' Married-spouse-absent') |
    | (' Self-emp-inc', ' Separated')             | occupation !=  Farming-fishing          | 0.95     | 0.17             | 100.00% from group (' Self-emp-inc', ' Married-spouse-absent')                                                              |
    | (' Self-emp-inc', ' Married-spouse-absent') | sex !=  Female AND occupation ==  Sales | 0.8      | 0.0              | Rule has no separation error.                                                                                               |
    | (' Self-emp-inc', ' Married-spouse-absent') | sex ==  Male AND occupation ==  Sales   | 0.8      | 0.0              | Rule has no separation error.                                                                                               |
    | (' Without-pay', ' Married-spouse-absent')  | age == 68                               | 1.0      | 0.0              | Rule has no separation error.                                                                                               |
    +---------------------------------------------+-----------------------------------------+----------+------------------+-----------------------------------------------------------------------------------------------------------------------------+

Example 3: Disjunctive Explanations
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The many to one explainer can provide explanations based on either conjunctive or disjunctive rules.
To get disjunctive explanations, set the ``explanation_form`` parameter to ``disj`` or ``disjunctive``.

.. code-block:: python

    # Import the necessary libraries
    import pandas as pd
    import pd_explain

    # Load the adult dataset
    adult = pd.read_csv(r'C:\adult.csv')

    # Call the many to one explainer with disjunctive explanations,
    # as well as select only the categorical attributes to consider, and disable sampling for more accurate (but slower) results.
    adult.explain(explainer='many_to_one', explanation_form='disj', labels='label',
                    attributes=['workclass', 'education', 'marital-status', 'occupation', 'relationship'], use_sampling=False)

**Output**:
.. table::
    +-----------------+--------------------------------------------------------+----------+------------------+--------------------------+
    | Group / Cluster | Explanation                                            | Coverage | Separation Error | Separation Error Origins |
    +=================+========================================================+==========+==================+==========================+
    | <=50K           | occupation != Prof-specialty OR education != Bachelors | 0.96     | 0.23             | 100.00% from group >50K  |
    | <=50K           | occupation != Prof-specialty                           | 0.91     | 0.21             | 100.00% from group >50K  |
    | >50K            | No explanation found                                   | NaN      | NaN              | NaN                      |
    +-----------------+--------------------------------------------------------+----------+------------------+--------------------------+

Example 4: Passing a DataFrame as Labels
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You can pass a DataFrame with more than one column as the labels, and not just a single column.
Doing so each unique combination of the columns will be considered as a separate label, much like in the case of a group-by operation.

.. code-block:: python

    # Import the necessary libraries
    import pandas as pd
    import pd_explain

    # Load the "adult" dataset
    adult = pd.read_csv(r'C:\adult.csv')

    # Select the labels
    labels = adult[['workclass', 'marital-status']]

    adult.drop(columns=['workclass', 'marital-status']).explain(explainer='many_to_one', labels=labels, pruning_method='min_dist', max_labels=3)

**Output**:
.. table::

    +---------------------------------------+--------------------------------------------------+----------+------------------+--------------------------------------------------------------------------------------------------------+
    | Group / Cluster                       | Explanation                                      | Coverage | Separation Error | Separation Error Origins                                                                               |
    +=======================================+==================================================+==========+==================+========================================================================================================+
    | ('State-gov', 'Never-married')        | relationship != Husband AND relationship != Wife | 1.0      | 0.05             | 85.71% from group ('?', 'Married-civ-spouse'), 14.29% from group ('Federal-gov', 'Married-civ-spouse') |
    | ('Federal-gov', 'Married-civ-spouse') | occupation != ? AND relationship == Husband      | 0.91     | 0.0              | Rule has no separation error.                                                                          |
    | ('?', 'Married-civ-spouse')           | occupation == ?                                  | 1.0      | 0.0              | Rule has no separation error.                                                                          |
    +---------------------------------------+--------------------------------------------------+----------+------------------+--------------------------------------------------------------------------------------------------------+


Example 5: Binning Numeric Labels
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If your labels are numeric, you can bin them into categories to get more meaningful explanations.
To do this, set the ``bin_numeric`` parameter to ``True``, and optionally set the ``num_bins`` parameter to control the number of bins.

.. code-block:: python

    # Import the necessary libraries
    import pandas as pd
    import pd_explain

    # Load the "adult" dataset
    adult = pd.read_csv(r'C:\adult.csv')

    # Call the many to one explainer, setting the bin_numeric parameter to True, and using a custom number of bins
    adult.explain(explainer='many_to_one', labels='education-num', bin_numeric=True, num_bins=4)

**Output**:
.. table::

    +----------------------+------------------------------------------------------+----------+------------------+--------------------------------------------------------------------------------+
    | Group / Cluster      | Explanation                                          | Coverage | Separation Error | Separation Error Origins                                                       |
    +======================+======================================================+==========+==================+================================================================================+
    | 0.999 < education-num <= 9.0 | education != Some-college AND education != Bachelors | 1.0      | 0.27             | 52.16% from group 13.0 < label <= 16.0, 47.84% from group 10.0 < label <= 13.0 |
    | 9.0 < education-num <= 10.0  | education == Some-college                            | 1.0      | 0.0              | Rule has no separation error.                                                  |
    | 10.0 < education-num <= 13.0 | No explanation found                                 | NaN      | NaN              | NaN                                                                            |
    | 13.0 < education-num <= 16.0 | No explanation found                                 | NaN      | NaN              | NaN                                                                            |
    +----------------------+------------------------------------------------------+----------+------------------+--------------------------------------------------------------------------------+


In this example, since the `education-num` column came from our dataframe, it had a name to display.
Let's instead provide it as a numpy array, and see how the output changes.

.. code-block:: python

    # Import the necessary libraries
    import pandas as pd
    import pd_explain

    # Load the "adult" dataset
    adult = pd.read_csv(r'C:\adult.csv')

    # Call the many to one explainer, setting the bin_numeric parameter to True, and using a custom number of bins
    adult.drop(columns='education-num').explain(explainer='many_to_one', labels=adult['education-num'].values, bin_numeric=True, num_bins=4)

**Output**:
.. table::

    +----------------------+------------------------------------------------------+----------+------------------+--------------------------------------------------------------------------------+
    | Group / Cluster      | Explanation                                          | Coverage | Separation Error | Separation Error Origins                                                       |
    +======================+======================================================+==========+==================+================================================================================+
    | 0.999 < label <= 9.0 | education != Some-college AND education != Bachelors | 1.0      | 0.27             | 52.16% from group 12.0 < label <= 16.0, 47.84% from group 10.0 < label <= 12.0 |
    | 9.0 < label <= 10.0  | education == Some-college                            | 1.0      | 0.0              | Rule has no separation error.                                                  |
    | 10.0 < label <= 12.0 | No explanation found                                 | NaN      | NaN              | NaN                                                                            |
    | 12.0 < label <= 16.0 | No explanation found                                 | NaN      | NaN              | NaN                                                                            |
    +----------------------+------------------------------------------------------+----------+------------------+--------------------------------------------------------------------------------+

As you can see, the output now displays the label as `label` instead of `education-num`.
If we want to change this, we can use the ``label_name`` parameter.

.. code-block:: python

    # Import the necessary libraries
    import pandas as pd
    import pd_explain

    # Load the "adult" dataset
    adult = pd.read_csv(r'C:\adult.csv')

    # Call the many to one explainer, setting the bin_numeric parameter to True, and using a custom number of bins
    adult.drop(columns='education-num').explain(explainer='many_to_one', labels=adult['education-num'].values, bin_numeric=True, num_bins=4, label_name='Education number')

**Output**:
.. table::

    +---------------------------------+------------------------------------------------------+----------+------------------+------------------------------------------------------------------------------------------------------+
    | Group / Cluster                 | Explanation                                          | Coverage | Separation Error | Separation Error Origins                                                                             |
    +=================================+======================================================+==========+==================+======================================================================================================+
    | 0.999 < Education number <= 9.0 | education != Some-college AND education != Bachelors | 1.0      | 0.27             | 52.16% from group 12.0 < Education number <= 16.0, 47.84% from group 10.0 < Education number <= 12.0 |
    | 9.0 < Education number <= 10.0  | education == Some-college                            | 1.0      | 0.0              | Rule has no separation error.                                                                        |
    | 10.0 < Education number <= 12.0 | No explanation found                                 | NaN      | NaN              | NaN                                                                                                  |
    | 12.0 < Education number <= 16.0 | No explanation found                                 | NaN      | NaN              | NaN                                                                                                  |
    +---------------------------------+------------------------------------------------------+----------+------------------+------------------------------------------------------------------------------------------------------+